Dynamic Programming: Policy Evaluation#
Some Background#
Dynamic programming refers to the set of methods that iteratively approximate the Bellman Equations to estimate the values of states or state-action pairs. The Bellman Equations are equations that relate the value of future states to the value of the current state, or the value of future state-action pairs to the value of the current state-action pair. They are solvable in the ideal situation where we know the dynamics of the environment, namely the transition probabilities from one state to another given an action, the associated reward, and the expected next states.
Mathematically, we want to estimate \(v_\pi(s)\) which is the value at state \(s\) given we follow a policy \(\pi(a|s)\) which gives us the probability of selecting action \(a\) given the agent is in state \(s\).
To solve this, we first look at the the Bellman Equations, which gives us the relationship:
\(v_\pi(s) = f(\pi(a|s), r, \gamma, \text{p}(s',r|s,a))\) which means \(v_\pi\) is a function of:
our policy \(\pi(a|s)\)
immediate next reward \(r\)
discount rate \(\gamma\), where we discount future rewards by this factor between [0, 1]
transition probabilities \(\text{p}(s',r|s,a))\)
Now, solving this directly becomes quite complicated especially with many states \(s\), because effectively this becomes a system of \(|S|\) unknowns \(v_\pi(s)\) and equations where \(|S|\) is the number of possible discrete states. Note that in this formulation, we assume that states, actions, and rewards are discrete.
A more practical method for estimating the value of \(v_\pi(s)\) is Dynamic Programming. Which iteratively refines the estimates, following this rule:
\(v_{k+1}(s) \leftarrow v_{k}(s)\)
Which means we use the previous estimate \(v_{k}(s)\) at iteration \(k\) to compute a refined estimate \(v_{k+1}(s)\). Now, in the limit as \(k \rightarrow \inf\), we expect
\(v_{k+1}(s) = v_{k}(s) = v_\pi(s)\)
for some arbitrary policy \(\pi(a|s)\)
The Experiment#
We experiment on a simple gridworld problem. We create an NxN grid where there are X cells that are the “goal states.” The objective is for an agent randomly placed on the grid to get to any of these goal states as quickly as possible.
Each transition from one state to another yields a -1 reward. If the agent bumps the border/edges of the grid, it stays in place but incurs a -1 reward. The reward in any of the goal states is 0. We use a discount rate of 0.99.
We experiment on three epsilon greedy policies: Greedy (eps = 0), Epsilon-Greedy (eps = 0.1), Random (eps=1). Greedy always goes to the cell with the largest estimated value \(v_k(s)\). If there are ties, it will pick randomly. Epsilon Greedy does the same thing except there’s an \(\epsilon\) chance it will pick randomly at each timestep. Random merely chooses any action with equal probability.
We wanted to see how these three policies explore the environment to estimate the state values \(v_k(s)\) using dynamic programming.
If the agent reaches any of the goal, it resets into a random state excluding the goal states. However, its knowledge of the state values are kept, and so it can continue to refine the estimates.
We visualize in the animation below the movement of the Greedy, Epsilon-Greedy, and Random agents, as well as the corresponding estimated state values \(v_k(s)\). We find that random is slowest to estimate the state values, which is interesting because I initially thought random would be able to cover more of the grid because it is not encouraged to go to the goals.
The goal is indicated by the yellow cell. In terms of the value estimates, redder means more negative while bluer means less negative or more positive.
Note that each frame in the animation below corresponds to 5 timesteps in the simulation.
<Figure size 640x480 with 0 Axes>
We ran our Greedy agent for 500,000 timesteps to get our “ground truth” \(v_\pi(s)\) estimate. Note that even if we use our random agent, it will still result in the same values, which we have validated.
We then compare the rates at which our agents reach the ground truth values by computing the mean squared error between the estimate and the ground truth over many timesteps.
We find that while Greedy leads to lower errors initially, there comes a time when Random leads to better estimates of \(v_\pi(s)\) as indicated by lower MSE. We hypothesize that this is because Greedy and Epsilon-Greedy tend to get biased to move towards the goal, which reduces the chance of moving into certain areas, thus leading to lower opportunities to refine the estimates there. On the other hand, Random does not have this bias and thus we expect to continuously refine its estimates across the grid.